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Abstract 


Because of their superior ability to pre¬ 
serve sequence information over time, 
Long Short-Term Memory (LSTM) net¬ 
works, a type of recurrent neural net¬ 
work with a more complex computational 
unit, have obtained strong results on a va¬ 
riety of sequence modeling tasks. The 
only underlying LSTM structure that has 
been explored so far is a linear chain. 
However, natural language exhibits syn¬ 
tactic properties that would naturally com¬ 
bine words to phrases. We introduce the 
Tree-LSTM, a generalization of LSTMs to 
tree-structured network topologies. Tree- 
LSTMs outperform all existing systems 
and strong LSTM baselines on two tasks: 
predicting the semantic relatedness of two 
sentences (SemEval 2014, Task 1) and 
sentiment classification (Stanford Senti¬ 
ment Treebank). 

1 Introduction 


Most models for distributed representations of 
phrases and sentences—that is, models where real¬ 
valued vectors are used to represent meaning—fall 
into one of three classes: bag-of-words models, 
sequence models, and tree-structured models. In 
bag-of-words models, phrase and sentence repre¬ 
sentations are independent of word order; for ex¬ 
ample, they can be generated by averaging con- 


stituent word representations (Landauer and Du- 

mais 

1997; 

Foltz et al. 

1998 

). In contrast, se- 


quence models construct sentence representations 
as an order-sensitive function of the sequence of 
tokens ( Elman| 1990 Mikolov| 2012) . Lastly, 
tree-structured models compose each phrase and 
sentence representation from its constituent sub¬ 
phrases according to a given syntactic structure 
over the sentence (Goller and Kuchlcr. 1996 


Socher et al. 20111. 
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Figure 1: Top: A chain-structured LSTM net¬ 
work. Bottom: A tree-structured LSTM network 
with arbitrary branching factor. 


Order-insensitive models are insufficient to 
fully capture the semantics of natural language 
due to their inability to account for differences in 
meaning as a result of differences in word order 
or syntactic structure ( e.g., “cats climb trees” vs. 
“trees climb cats”). We therefore turn to order- 
sensitive sequential or tree-structured models. In 
particular, tree-structured models are a linguisti¬ 
cally attractive option due to their relation to syn¬ 
tactic interpretations of sentence structure. A nat¬ 
ural question, then, is the following: to what ex¬ 
tent (if at all) can we do better with tree-structured 
models as opposed to sequential models for sen¬ 
tence representation? In this paper, we work to¬ 
wards addressing this question by directly com¬ 
paring a type of sequential model that has recently 
been used to achieve state-of-the-art results in sev¬ 
eral NLP tasks against its tree-structured general¬ 
ization. 

Due to their capability for processing arbitrary- 
length sequences, recurrent neural networks 
























































(RNNs) are a natural choice for sequence model¬ 
ing tasks. Recently, RNNs with Long Short-Term 


Memory (LSTM) units (Hochreiter and Schmid- 


huberj 1997j) have re-emerged as a popular archi¬ 


tecture due to their representational power and ef¬ 
fectiveness at capturing long-term dependencies. 
LSTM networks, which we review in Sec. [2] have 
been successfully applied to a variety of sequence 
modeling and prediction tasks, notably machine 
translation (Bahdaiiau et al.|[2014[|Sutskever et al. 


20141, speech recognition (Graves et al. 20131, 


image caption generation (Vinyals et al. 2014), 


and program execution (Zaremba and Sutskever 


2014). 


In this paper, we introduce a generalization of 
the standard LSTM architecture to tree-structured 
network topologies and show its superiority for 
representing sentence meaning over a sequential 
LSTM. While the standard LSTM composes its 
hidden state from the input at the current time 
step and the hidden state of the LSTM unit in the 
previous time step, the tree-structured LSTM, or 
Tree-LSTM, composes its state from an input vec¬ 
tor and the hidden states of arbitrarily many child 
units. The standard LSTM can then be considered 
a special case of the Tree-LSTM where each inter¬ 
nal node has exactly one child. 

In our evaluations, we demonstrate the empiri¬ 
cal strength of Tree-LSTMs as models for repre¬ 
senting sentences. We evaluate the Tree-LSTM 
architecture on two tasks: semantic relatedness 
prediction on sentence pairs and sentiment clas¬ 
sification of sentences drawn from movie reviews. 
Our experiments show that Tree-LSTMs outper¬ 
form existing systems and sequential LSTM base¬ 
lines on both tasks. Implementations of our mod¬ 
els and experiments are available at https : // 
github.com/stanfordnlp/treelstm 


2 Long Short-Term Memory Networks 
2.1 Overview 


Recurrent neural networks (RNNs) are able to pro¬ 
cess input sequences of arbitrary length via the re¬ 
cursive application of a transition function on a 
hidden state vector h t . At each time step f, the 
hidden state h t is a function of the input vector xt 
that the network receives at time t and its previous 
hidden state ht- 1 . For example, the input vector xt 
could be a vector representation of the t-th word in 
body of text (Elman 1990 Mikolov| 2012). The 
hidden state ht E can be interpreted as a d- 


dimensional distributed representation of the se¬ 
quence of tokens observed up to time t. 

Commonly, the RNN transition function is an 
affine transformation followed by a pointwise non¬ 
linearity such as the hyperbolic tangent function: 

ht. = tanh (Wxt + Uht-i + b ) . 


Unfortunately, a problem with RNNs with transi¬ 
tion functions of this form is that during training, 
components of the gradient vector can grow or de¬ 


cay exponentially over long sequences (Hochre- 


iter| 1998| Bengio et al.[ 1994 1 . This problem with 


exploding or vanishing gradients makes it difficult 
for the RNN model to learn long-distance correla¬ 
tions in a sequence. 

The LSTM architecture (Hochreiter and 


Schmidhuber| 19971 addresses this problem of 


learning long-term dependencies by introducing a 
memory cell that is able to preserve state over long 
periods of time. While numerous LSTM variants 
have been described, here we describe the version 
used by Zaremba and Sutskever (2014). 

We define the LSTM unit at each time step t to 
be a collection of vectors in U. d : an input gate i t , a 
forget gate ft, an output gate ot, a memory cell ct 
and a hidden state h t . The entries of the gating 
vectors i t , ft and ot are in [0,1]. We refer to d as 
the memory dimension of the LSTM. 

The LSTM transition equations are the follow¬ 
ing: 


i t = a(w®xt + U®h t - 1 + b®'), ( 1 ) 

ft = cr (w^xt + U^ht -1 + , 

o t = a (W^xt + U^h t -1 + feW) , 

u t = tanh (w^x t + U^ht-i + b^ u A , 

c t = i t Qut + f t O c t - 1 , 
h t = o t 0 taiih(ci), 


where xt is the input at the current time step, o de¬ 
notes the logistic sigmoid function and Q denotes 
elementwise multiplication. Intuitively, the for¬ 
get gate controls the extent to which the previous 
memory cell is forgotten, the input gate controls 
how much each unit is updated, and the output gate 
controls the exposure of the internal memory state. 
The hidden state vector in an LSTM unit is there¬ 
fore a gated, partial view of the state of the unit’s 
internal memory cell. Since the value of the gating 
variables vary for each vector element, the model 











































can learn to represent information over multiple 
time scales. 


2.2 Variants 

Two commonly-used variants of the basic LSTM 
architecture are the Bidirectional LSTM and the 
Multilayer LSTM (also known as the stacked or 
deep LSTM). 

Bidirectional LSTM. A Bidirectional LSTM 
( Graves et al.||2013] ) consists of two LSTMs that 
are run in parallel: one on the input sequence and 
the other on the reverse of the input sequence. At 
each time step, the hidden state of the Bidirec¬ 
tional LSTM is the concatenation of the forward 
and backward hidden states. This setup allows the 
hidden state to capture both past and future infor¬ 
mation. 


Multilayer LSTM. In Multilayer LSTM archi¬ 
tectures, the hidden state of an LSTM unit in layer 
i is used as input to the LSTM unit in layer i +1 in 


the same time step ( 

Graves et al. 

2013; Sutskever 

et al. 2014 

Zaremba and Sutskeverl 

2014). Here, 


the idea is to let the higher layers capture longer- 
term dependencies of the input sequence. 


These two variants can be combined as a Multi¬ 


layer Bidirectional LSTM (Graves et al. 2013). 


3 Tree-Structured LSTMs 


A limitation of the LSTM architectures described 
in the previous section is that they only allow for 
strictly sequential information propagation. Here, 
we propose two natural extensions to the basic 
LSTM architecture: the Child-Sum Tree-LSTM 
and the N-ary Tree-LSTM. Both variants allow for 
richer network topologies where each LSTM unit 
is able to incorporate information from multiple 
child units. 

As in standard LSTM units, each Tree-LSTM 
unit (indexed by j ) contains input and output 
gates ij and Oj, a memory cell Cj and hidden 
state hj. The difference between the standard 
LSTM unit and Tree-LSTM units is that gating 
vectors and memory cell updates are dependent 
on the states of possibly many child units. Ad¬ 
ditionally, instead of a single forget gate, the Tree- 
LSTM unit contains one forget gate fp- for each 
child k. This allows the Tree-LSTM unit to se¬ 
lectively incorporate information from each child. 
For example, a Tree-LSTM model can learn to em¬ 
phasize semantic heads in a semantic relatedness 



Figure 2: Composing the memory cell ci and hid¬ 
den state h\ of a Tree-LSTM unit with two chil¬ 
dren (subscripts 2 and 3). Labeled edges cor¬ 
respond to gating by the indicated gating vector, 
with dependencies omitted for compactness. 


task, or it can learn to preserve the representation 
of sentiment-rich children for sentiment classifica¬ 
tion. 

As with the standard LSTM, each Tree-LSTM 
unit takes an input vector Xj. In our applications, 
each Xj is a vector representation of a word in a 
sentence. The input word at each node depends 
on the tree structure used for the network. For in¬ 
stance, in a Tree-LSTM over a dependency tree, 
each node in the tree takes the vector correspond¬ 
ing to the head word as input, whereas in a Tree- 
LSTM over a constituency tree, the leaf nodes take 
the corresponding word vectors as input. 

3.1 Child-Sum Tree-LSTMs 

Given a tree, let C(j) denote the set of children 
of node j. The Child-Sum Tree-LSTM transition 
equations are the following: 


hj — ^ ^ hfc. 

(2) 

kec(j) 

^ = a (w® Xj + U (i) hj + , 

(3) 

f jk = a {w^Xj + U^h k + b^) , 

(4) 

0j = a (w^xj + U^hj + b io) ^j , 

(5) 

Uj = tanh (w (u) Xj + U^hj + 6 (u) ) , 

(6) 

(-‘j ij © tlj T ^ ' fjk © t'ki 

(V) 

keC(j) 

hj = Oj © tanh(cj), 

(8) 

where in Eq. |4] k E C(j). 


Intuitively, we can interpret each parameter ma¬ 
trix in these equations as encoding correlations be¬ 
tween the component vectors of the Tree-LSTM 


























unit, the input Xj, and the hidden states hk of the 
unit’s children. For example, in a dependency tree 
application, the model can learn parameters IF*'- 1 
such that the components of the input gate i :} have 
values close to 1 {i.e., “open”) when a semanti¬ 
cally important content word (such as a verb) is 
given as input, and values close to 0 {i.e., “closed”) 
when the input is a relatively unimportant word 
(such as a determiner). 

Dependency Tree-LSTMs. Since the Child- 
Sum Tree-LSTM unit conditions its components 
on the sum of child hidden states hk, it is well- 
suited for trees with high branching factor or 
whose children are unordered. For example, it is a 
good choice for dependency trees, where the num¬ 
ber of dependents of a head can be highly variable. 
We refer to a Child-Sum Tree-LSTM applied to a 
dependency tree as a Dependency Tree-LSTM. 


3.2 iV-ary Tree-LSTMs 

The A r -ary Tree-LSTM can be used on tree struc¬ 
tures where the branching factor is at most N and 
where children are ordered, i.e., they can be in¬ 
dexed from 1 to N. For any node j, write the hid¬ 
den state and memory cell of its A:th child as hjk 
and Cjk respectively. The iV-ary Tree-LSTM tran¬ 
sition equations are the following: 


ij = cr 
fjk = & 

°j = a 


N 


W®x j + J2u? ) h j t + b®\, (9) 

v e =i / 

' N \ 

w^x j + '£u ( t ' ) h je + b^\, 


1=1 / 

(10) 

N \ 


X>< 0 N, + 6<”> 

, (11) 


t= 1 


N 


Uj = tanh 


W (u) Xj + U^hji + 6 (u) 


i=i 


( 12 ) 


N 

°j = h © u j + Y1 ® c j? > 
£= 1 

hj = Oj 0 tanh(cj), 


(13) 

(14) 


where in Eq. [TOj k = 1,2,... ,N. Note that 
when the tree is simply a chain, both Eqs. [2]-[8] 
and Eqs. [9]-[T4| reduce to the standard LSTM tran¬ 
sitions, Eqs.[l] 

The introduction of separate parameter matri¬ 
ces for each child k allows the iV-ary Tree-LSTM 


model to learn more fine-grained conditioning on 
the states of a unit’s children than the Child- 
Sum Tree-LSTM. Consider, for example, a con¬ 
stituency tree application where the left child of a 
node corresponds to a noun phrase, and the right 
child to a verb phrase. Suppose that in this case 
it is advantageous to emphasize the verb phrase 
in the representation. Then the U. i parameters 
can be trained such that the components of fji are 
close to 0 {i.e., “forget”), while the components of 
fj 2 are close to 1 {i.e., “preserve”). 

Forget gate parameterization. In Eq. |T0} we 

define a parameterization of the kth child’s for¬ 
get gate fjk that contains “off-diagonal” param- 
eter matrices U^y, k y l. This parameteriza¬ 
tion allows for more flexible control of informa¬ 
tion propagation from child to parent. For exam¬ 
ple, this allows the left hidden state in a binary tree 
to have either an excitatory or inhibitory effect on 
the forget gate of the right child. Flowever, for 
large values of N, these additional parameters are 
impractical and may be tied or fixed to zero. 

Constituency Tree-LSTMs. We can naturally 
apply Binary Tree-LSTM units to binarized con¬ 
stituency trees since left and right child nodes are 
distinguished. We refer to this application of Bi¬ 
nary Tree-LSTMs as a Constituency Tree-LSTM. 
Note that in Constituency Tree-LSTMs, a node j 
receives an input vector Xj only if it is a leaf node. 

In the remainder of this paper, we focus on 
the special cases of Dependency Tree-LSTMs and 
Constituency Tree-LSTMs. These architectures 
are in fact closely related; since we consider only 
binarized constituency trees, the parameterizations 
of the two models are very similar. The key dif¬ 
ference is in the application of the compositional 
parameters: dependent vs. head for Dependency 
Tree-LSTMs, and left child vs. right child for Con¬ 
stituency Tree-LSTMs. 

4 Models 

We now describe two specific models that apply 
the Tree-LSTM architectures described in the pre¬ 
vious section. 

4.1 Tree-LSTM Classification 

In this setting, we wish to predict labels y from a 
discrete set of classes y for some subset of nodes 
in a tree. For example, the label for a node in a 


parse tree could correspond to some property of 
the phrase spanned by that node. 

At each node j, we use a softmax classifier to 
predict the label ij 3 given the inputs { x } :j observed 
at nodes in the subtree rooted at j. The classifier 
takes the hidden state hj at the node as input: 

pe(y | {rc}y) = softmax (w^hj + , 

Vj = arg max pg ( y \ {x}j ). 
y 

The cost function is the negative log-likelihood 
of the true class labels if k) at each labeled node: 


m 


m 


E 

k= 1 


log Pe (y (k) 


{x}^ 


+ 


12 
12 > 


where m is the number of labeled nodes in the 
training set, the superscript k indicates the /cth la¬ 
beled node, and A is an L2 regularization hyperpa¬ 
rameter. 


comparison of the signs of the input representa¬ 
tions. 

We want the expected rating under the predicted 
distribution pg given model parameters 9 to be 
close to the gold rating y e [1, K]\ y = r T pg ss y. 
We therefore define a sparse target distributiorfjjp 
that satisfies y = r T p: 



y-[y\, 

L: y\ -2/ + 1 ’ 


0 


i=[y\+l 

* = L y\ 

otherwise 


for 1 < i < K. The cost function is the regular¬ 
ized KL-divergence between p and pg: 


m 


i 

m 


E KL ( p(fc) 

k= 1 



+ > 


2 

2 ) 


where m is the number of training pairs and the 
superscript k indicates the kth sentence pair. 


4.2 Semantic Relatedness of Sentence Pairs 5 Experiments 


Given a sentence pair, we wish to predict a 
real-valued similarity score in some range [1 ,K], 
where K > 1 is an integer. The sequence 
{1.2,..., K ) is some ordinal scale of similarity, 
where higher scores indicate greater degrees of 
similarity, and we allow real-valued scores to ac¬ 
count for ground-truth ratings that are an average 
over the evaluations of several human annotators. 

We first produce sentence representations hi, 
and hji for each sentence in the pair using a 
Tree-LSTM model over each sentence’s parse tree. 
Given these sentence representations, we predict 
the similarity score y using a neural network that 
considers both the distance and angle between the 
pair {h L ,h R ): 

h x =h L Qh R , (15) 

h+ = \hL — h R |, 

h s = a (w M h x + W {+) h+ + , 

pg = softmax (w^h s + b^^j , 
y = r T p e , 

where r T = [12 ... I\] and the absolute value 
function is applied elementwise. The use of both 
distance measures h x and h + is empirically mo¬ 
tivated: we find that the combination outperforms 
the use of either measure alone. The multiplicative 
measure h x can be interpreted as an elementwise 


We evaluate our Tree-LSTM architectures on two 
tasks: (1) sentiment classification of sentences 
sampled from movie reviews and (2) predicting 
the semantic relatedness of sentence pairs. 

In comparing our Tree-LSTMs against sequen¬ 
tial LSTMs, we control for the number of LSTM 
parameters by varying the dimensionality of the 
hidden stated Details for each model variant are 
summarized in Table Q] 


5.1 Sentiment Classification 


In this task, we predict the sentiment of sen¬ 
tences sampled from movie reviews. We use 
the Stanford Sentiment Treebank ( jSocher et al.| 


2013). There are two subtasks: binary classifica¬ 


tion of sentences, and fine-grained classification 
over five classes: very negative, negative, neu¬ 
tral, positive, and very positive. We use the stan¬ 
dard train/dev/test splits of 6920/872/1821 for the 
binary classification subtask and 8544/1101/2210 
for the fine-grained classification subtask (there 
are fewer examples for the binary subtask since 


'in the subsequent experiments, we found that optimizing 
this objective yielded better performance than a mean squared 
error objective. 

2 For our Bidirectional LSTMs, the parameters of the for¬ 
ward and backward transition functions are shared. In our 
experiments, this achieved superior performance to Bidirec¬ 
tional LSTMs with untied weights and the same number of 
parameters (and therefore smaller hidden vector dimension¬ 
ality). 










LSTM Variant 

Relatedness 

Sentiment 

d 

\o\ 

d 

1*1 

Standard 

150 

203,400 

168 

315,840 

Bidirectional 

150 

203,400 

168 

315,840 

2-layer 

108 

203,472 

120 

318,720 

Bidirectional 2-layer 

108 

203,472 

120 

318,720 

Constituency Tree 

142 

205,190 

150 

316,800 

Dependency Tree 

150 

203,400 

168 

315,840 


Table 1: Memory dimensions d and composition 
function parameter counts |0| for each LSTM vari¬ 
ant that we evaluate. 


neutral sentences are excluded). Standard bina¬ 
rized constituency parse trees are provided for 
each sentence in the dataset, and each node in 
these trees is annotated with a sentiment label. 

For the sequential LSTM baselines, we predict 
the sentiment of a phrase using the representation 
given by the final LSTM hidden state. The sequen¬ 
tial LSTM models are trained on the spans corre¬ 
sponding to labeled nodes in the training set. 

We use the classification model described in 


Sec. |4.1| with both Dependency Tree-LSTMs 
(Sec. 3TTh and Constituency Tree-LSTMs 


(Sec. |3.2| ). The Constituency Tree-LSTMs are 
structured according to the provided parse trees. 
For the Dependency Tree-LSTMs, we produce 
dependency parse^jof each sentence; each node 
in a tree is given a sentiment label if its span 
matches a labeled span in the training set. 

5.2 Semantic Relatedness 

For a given pair of sentences, the semantic relat¬ 
edness task is to predict a human-generated rating 
of the similarity of the two sentences in meaning. 
We use the Sentences Involving Composi¬ 


tional Knowledge (SICK) dataset (Marelli et al. 


2014), consisting of 9927 sentence pairs in a 
4500/500/4927 train/dev/test split. The sentences 
are derived from existing image and video descrip¬ 
tion datasets. Each sentence pair is annotated with 
a relatedness score y e [1,5], with 1 indicating 
that the two sentences are completely unrelated, 
and 5 indicating that the two sentences are very 
related. Each label is the average of 10 ratings as¬ 
signed by different human annotators. 

Here, we use the similarity model described in 


Sec. 4.2 For the similarity prediction network 


(Eqs. 151 we use a hidden layer of size 50. We 


^Dependency parses produced by the Stanford Neural 
Network Dependency Parser (Chen and Manning 2014). 


Method 


Fine-grained Binary 


RAE <Soc heretal.j2013l 

" Socher et al.jlf) 13 


MV-RNN 

RNTN 

DCNN 


Socher et al. 12013] 


Blunsom et i 


rjzpR 

Paragraph-Vec | Le and Mildpov.|2 0 14 


CNN-non-static |Kim, 
CNN-multichannel 


DRNN Irsoy and Cardie] 2014] 


2014 

m fe014 

die|2014 


43.2 

82.4 

44.4 

82.9 

45.7 

85.4 

48.5 

86.8 

48.7 

87.8 

48.0 

87.2 

47.4 

88.1 

49.8 

86.6 


LSTM 

Bidirectional LSTM 

2-layer LSTM 

2-layer Bidirectional LSTM 

46.4 (1.1) 
49.1 (1.0) 
46.0 (1.3) 

48.5 (1.0) 

84.9 (0.6) 
87.5 (0.5) 
86.3 (0.6) 
87.2 (1.0) 

Dependency Tree-LSTM 

48.4 (0.4) 

85.7 (0.4) 

Constituency Tree-LSTM 



- randomly initialized vectors 

43.9 (0.6) 

82.0 (0.5) 

- Glove vectors, fixed 

49.7 (0.4) 

87.5 (0.8) 

- Glove vectors, tuned 

51.0 (0.5) 

88.0 (0.3) 


Table 2: Test set accuracies on the Stanford Sen¬ 
timent Treebank. For our experiments, we report 
mean accuracies over 5 runs (standard deviations 
in parentheses). Fine-grained: 5-class sentiment 
classification. Binary: positive/negative senti¬ 
ment classification. 


produce binarized constituency parse^jand depen¬ 
dency parses of the sentences in the dataset for our 
Constituency Tree-LSTM and Dependency Tree- 
LSTM models. 


5.3 Hyperparameters and Training Details 

The hyperparameters for our models were tuned 
on the development set for each task. 

We initialized our word representations using 
publicly available 300-dimensional Glove vec¬ 
tor^ (Pennington et al.. 2014). For the sentiment 
classification task, word representations were up¬ 
dated during training with a learning rate of 0.1. 
For the semantic relatedness task, word represen¬ 
tations were held fixed as we did not observe any 
significant improvement when the representations 
were tuned. 


Our models were trained using AdaGrad (Duchi 


et al. 2011) with a learning rate of 0.05 and a 


minibatch size of 25. The model parameters were 
regularized with a per-minibatch L2 regularization 
strength of 10“ 4 . The sentiment classifier was ad¬ 


ditionally regularized using dropout (Hinton et al. 


2012) with a dropout rate of 0.5. We did not ob¬ 
serve performance gains using dropout on the se¬ 
mantic relatedness task. 


Constituency parses produced by the Stanford PCFG 


20031. 


Parser (Klein and Manning 

5 Trained on 840 billion tokens of Common Crawl data, 

http://nip.Stanford.edu/projects/glove/ 
































































Method 







Pearson’s r 

Spearman’s p 

MSE 

Illinois-LH 

Lai and Hockenmaier] 

2014 

0.7993 

0.7538 

0.3692 

UNAL-NLF 

(Jimenez et al. 

2014) 


0.8070 

0.7489 

0.3550 

Meanin 

2 , Factory ( 

3jerva et al. 

2014 

0.8268 

0.7721 

0.3224 

ECNU ( 

Zhao et al. 

201- 





0.8414 

- 

- 

Mean vectors 






0.7577 (0.0013) 

0.6738 (0.0027) 

0.4557 (0.0090) 

DT-RNN (S 

ocher et al. 

2014 



0.7923 (0.0070) 

0.7319 (0.0071) 

0.3822 (0.0137) 

SDT-RNN I 

Socher et ah 

2014 



0.7900 (0.0042) 

0.7304 (0.0076) 

0.3848 (0.0074) 

LSTM 








0.8528 (0.0031) 

0.7911 (0.0059) 

0.2831 (0.0092) 


Bidirectional LSTM 0.8567 (0.0028) 0.7966 (0.0053) 0.2736 (0.0063) 

2-layer LSTM 0.8515 (0.0066) 0.7896 (0.0088) 0.2838 (0.0150) 

2-layer Bidirectional LSTM 0.8558 (0.0014) 0.7965 (0.0018) 0.2762 (0.0020) 

Constituency Tree-LSTM 0.8582 (0.0038) 0.7966 (0.0053) 0.2734 (0.0108) 

Dependency Tree-LSTM 0.8676 (0.0030) 0.8083 (0.0042) 0.2532 (0.0052) 


Table 3: Test set results on the SICK semantic relatedness subtask. For our experiments, we report mean 
scores over 5 runs (standard deviations in parentheses). Results are grouped as follows: (1) SemEval 
2014 submissions; (2) Our own baselines; (3) Sequential LSTMs; (4) Tree-structured LSTMs. 


6 Results 

6.1 Sentiment Classification 

Our results are summarized in Table [2] The Con¬ 
stituency Tree-LSTM outperforms existing sys¬ 
tems on the fine-grained classification subtask and 
achieves accuracy comparable to the state-of-the- 
art on the binary subtask. In particular, we find that 
it outperforms the Dependency Tree-LSTM. This 
performance gap is at least partially attributable to 
the fact that the Dependency Tree-LSTM is trained 
on less data: about 150K labeled nodes vs. 319K 
for the Constituency Tree-LSTM. This difference 
is due to (1) the dependency representations con¬ 
taining fewer nodes than the corresponding con¬ 
stituency representations, and (2) the inability to 
match about 9% of the dependency nodes to a cor¬ 
responding span in the training data. 

We found that updating the word representa¬ 
tions during training (“fine-tuning” the word em¬ 
bedding) yields a significant boost in performance 
on the fine-grained classification subtask and gives 
a minor gain on the binary classification subtask 
(this finding is consistent with previous work on 
this task by Kim] ( |2014[ )). These gains are to be 
expected since the Glove vectors used to initial¬ 
ize our word representations were not originally 
trained to capture sentiment. 


tion metrics. The first two metrics are measures of 
correlation against human evaluations of semantic 
relatedness. 

We compare our models against a number of 
non-LSTM baselines. The mean vector baseline 
computes sentence representations as a mean of 
the representations of the constituent words. The 
DT-RNN and SDT-RNN models (ISocher et al.l 


2014) both compose vector representations for the 


nodes in a dependency tree as a sum over affine- 
transformed child vectors, followed by a nonlin¬ 
earity. The SDT-RNN is an extension of the DT- 
RNN that uses a separate transformation for each 
dependency relation. For each of our baselines, 
including the LSTM models, we use the similarity 
model described in Sec. 14.21 

We also compare against four of the top¬ 
performing system^] submitted to the SemEval 
2014 semantic relatedness shared task: ECNU 
( Zhao et ak| 2014), The Meaning Factory ( |Bjerva 


et al. 2014), UNAL-NLP (Jimenez et al.j, 2014), 


and Illinois-LH (Lai and Hockenmaier, |2014| ). 
These systems are heavily feature engineered, 
generally using a combination of surface form 
overlap features and lexical distance features de¬ 
rived from WordNet or the Paraphrase Database 
(Ganitkevitch et akj 2013 j ). 

Our LSTM models outperform all these sys- 


6.2 Semantic Relatedness 

Our results are summarized in Table [3] Following 


Marelli et al. (2014), we use Pearson’s r, Spear¬ 


man’s p and mean squared error (MSE) as evalua¬ 


6 We list the strongest results we were able to find for this 
task; in some cases, these results are stronger than the official 
performance by the team on the shared task. For example, 
the listed result by |Zhao et ah] l |2014| is stronger than their 
submitted system’s Pearson correlation score of 0.8280. 

































































Figure 3: Fine-grained sentiment classification ac¬ 
curacy vs. sentence length. For each i, we plot 
accuracy for the test set sentences with length in 
the window [l — 2,1 + 2). Examples in the tail 
of the length distribution are batched in the final 
window (£ = 45). 

terns without any additional feature engineering, 
with the best results achieved by the Dependency 
Tree-LSTM. Recall that in this task, both Tree- 
LSTM models only receive supervision at the root 
of the tree, in contrast to the sentiment classifi¬ 
cation task where supervision was also provided 
at the intermediate nodes. We conjecture that in 
this setting, the Dependency Tree-LSTM benefits 
from its more compact structure relative to the 
Constituency Tree-LSTM, in the sense that paths 
from input word vectors to the root of the tree 
are shorter on aggregate for the Dependency Tree- 
LSTM. 

7 Discussion and Qualitative Analysis 

7.1 Modeling Semantic Relatedness 

In Table |4] we list nearest-neighbor sentences re¬ 
trieved from a 1000-sentence sample of the SICK 
test set. We compare the neighbors ranked by the 
Dependency Tree-LSTM model against a baseline 
ranking by cosine similarity of the mean word vec¬ 
tors for each sentence. 

The Dependency Tree-LSTM model exhibits 
several desirable properties. Note that in the de¬ 
pendency parse of the second query sentence, the 
word “ocean” is the second-furthest word from the 
root (“waving”), with a depth of 4. Regardless, the 
retrieved sentences are all semantically related to 
the word “ocean”, which indicates that the Tree- 
LSTM is able to both preserve and emphasize in¬ 
formation from relatively distant nodes. Addi¬ 
tionally, the Tree-LSTM model shows greater ro- 



mean sentence length 


Ligure 4: Pearson correlations r between pre¬ 
dicted similarities and gold ratings vs. sentence 
length. Lor each £, we plot r for the pairs with 
mean length in the window [1—2,1+2], Examples 
in the tail of the length distribution are batched in 
the final window (/' = 18.5). 

bustness to differences in sentence length. Given 
the query “two men are playing guitar”, the Tree- 
LSTM associates the phrase “playing guitar” with 
the longer, related phrase “dancing and singing in 
front of a crowd” (note as well that there is zero 
token overlap between the two phrases). 

7.2 Effect of Sentence Length 

One hypothesis to explain the empirical strength 
of Tree-LSTMs is that tree structures help miti¬ 
gate the problem of preserving state over long se¬ 
quences of words. If this were true, we would ex¬ 
pect to see the greatest improvement over sequen¬ 
tial LSTMs on longer sentences. In Ligs.[3]an dEB 
we show the relationship between sentence length 
and performance as measured by the relevant task- 
specific metric. Each data point is a mean score 
over 5 runs, and error bars have been omitted for 
clarity. 

We observe that while the Dependency Tree- 
LSTM does significantly outperform its sequen¬ 
tial counterparts on the relatedness task for 
longer sentences of length 13 to 15 (Lig. EB>- il 
also achieves consistently strong performance on 
shorter sentences. This suggests that unlike se¬ 
quential LSTMs, Tree-LSTMs are able to encode 
semantically-useful structural information in the 
sentence representations that they compose. 


8 Related Work 


Distributed representations of words ( 

Rumelhart 

et al. ( 1988; Collobert et al., 2011 Turian et al. ( 

2010; Huang et al. 

2012, 

Mikolov et al. 

2013, 





























































































Ranking by mean word vector cosine similarity 

Score 

Ranking by Dependency Tree-LSTM model 

Score 

a woman is slicing potatoes 


a woman is slicing potatoes 


a woman is cutting potatoes 

0.96 

a woman is cutting potatoes 

4.82 

a woman is slicing herbs 

0.92 

potatoes are being sliced by a woman 

4.70 

a woman is slicing tofu 

0.92 

tofu is being sliced by a woman 

4.39 

a boy is waving at some young runners from the ocean 


a boy is waving at some young runners from the ocean 


a man and a boy are standing at the bottom of some stairs , 

0.92 

a group of men is playing with a ball on the beach 

3.79 

which are outdoors 




a group of children in uniforms is standing at a gate and 

0.90 

a young boy wearing a red swimsuit is jumping out of a 

3.37 

one is kissing the mother 


blue kiddies pool 


a group of children in uniforms is standing at a gate and 

0.90 

the man is tossing a kid into the swimming pool that is 

3.19 

there is no one kissing the mother 


near the ocean 


two men are playing guitar 


two men are playing guitar 


some men are playing rugby 

0.88 

the man is singing and playing the guitar 

4.08 

two men are talking 

0.87 

the man is opening the guitar for donations and plays 

4.01 



with the case 


two dogs are playing with each other 

0.87 

two men are dancing and singing in front of a crowd 

4.00 


Table 4: Most similar sentences from a 1000-sentence sample drawn from the SICK test set. The Tree- 
LSTM model is able to pick up on more subtle relationships, such as that between “beach” and “ocean” 
in the second example. 


Pennington et al.J |2014j ) have found wide appli¬ 
cability in a variety of NLP tasks. Following 
this success, there has been substantial interest in 
the area of learning distributed phrase and sen¬ 
tence representations (Mitchell and Lapata] 2010 


Yessenalina and Cardie! [20111 [Grefenstette et al. 


2013[|Mikolov et ak||2013| ), as well as distributed 


representations of longer bodies of text such as 


paragraphs and documents (Sri vastava et al.| 2013 
|Le and Mikolov[|2014 1 . 


Our approach builds on recursive neural net¬ 
works (jGoller and Kuchler 1996 Socher et al. 


2011), which we abbreviate as Tree-RNNs in or¬ 


der to avoid confusion with recurrent neural net¬ 
works. Under the Tree-RNN framework, the vec¬ 
tor representation associated with each node of 
a tree is composed as a function of the vectors 
corresponding to the children of the node. The 
choice of composition function gives rise to nu¬ 
merous variants of this basic framework. Tree- 
RNNs have been used to parse images of natu¬ 


ral scenes (Socher et al. 201 ll, compose phrase 


representations from word vectors (Socher et al. 


2012), and classify the sentiment polarity of sen¬ 


tences ( Socher et al.[|2013 i. 


9 Conclusion 


In this paper, we introduced a generalization of 
LSTMs to tree-structured network topologies. The 
Tree-LSTM architecture can be applied to trees 
with arbitrary branching factor. We demonstrated 
the effectiveness of the Tree-LSTM by applying 
the architecture in two tasks: semantic relatedness 


and sentiment classification, outperforming exist¬ 
ing systems on both. Controlling for model di¬ 
mensionality, we demonstrated that Tree-LSTM 
models are able to outperform their sequential 
counterparts. Our results suggest further lines of 
work in characterizing the role of structure in pro¬ 
ducing distributed representations of sentences. 
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